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ABSTRACT 

The widespread use of location-aware devices has led to count- 
less location-based services in which a user query can be arbitrar- 
ily complex, i.e., one that embeds multiple spatial selection and 
join predicates. Amongst these predicates, the fc-Nearest-Neighbor 
(fcNN) predicate stands as one of the most important and widely 
used predicates. Unlike related research, this paper goes beyond 
the optimization of queries with single fcNN predicates, and shows 
how queries with two fcNN predicates can be optimized. In partic- 
ular, the paper addresses the optimization of queries with: (i) two 
fcNN-select predicates, (ii) two fcNN-join predicates, and (iii) one 
fcNN-join predicate and one fcNN-select predicate. For each type 
of queries, conceptually correct query evaluation plans (QEPs) and 
new algorithms that optimize the query execution time are pre- 
sented. Experimental results demonstrate that the proposed algo- 
rithms outperform the conceptually correct QEPs by orders of mag- 
nitude. 

1. INTRODUCTION 

Many emerging applications of location-based services demand 
complex location-based queries. These queries can contain mul- 
tiple predicates that involve a combination of spatial (e.g., fcNN 
and range) predicates along with the traditional selects, joins, and 
group-by's of relational databases. 

Although a large spectrum of research has been devoted to query 
processing of location-based queries (e.g., [12, 5, 10, 11, 21, 20, 
9, 8]), none addresses the processing and optimization of location- 
based queries that contain multiple location-based predicates. 

The key issue in queries with multiple location-based predicates 
is that they can produce different results based on the order in which 
the predicates are evaluated. This results in an ambiguity on the 
intended semantics of these queries. In [19], we study the con- 
ceptual evaluation of queries that include multiple similarity predi- 
cates [16]: similarity group-by (e.g., group-around) [17], similarity 
join (e.g., e-join, fcNN-join, and join-around) [18], and similarity 
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selection (e.g., e-selection and fcNN-selection). In [19], we pro- 
vide equivalence rules for similarity queries in the form of algebraic 
transformations that focus on the correctness of these transforma- 
tions, but do not introduce any algorithms for the efficient evalua- 
tion of similarity queries. In contrast, this paper introduces efficient 
algorithms for processing queries with two fcNN predicates while 
retaining the correctness of their evaluation. 

In this paper, we focus on the operations: fcNN-select and fcNN- 
join. While these operations have a variety of flavors, the ones we 
adopt in this paper are explained as follows. Assume that we have 
two sets, say E\ and E2, of points in the two-dimensional space. 
For simplicity, we use the Euclidean distance. 

• fcNN-select: For a focal point /, CT fe j(£i) returns from the 
set of points in E\ the fc-closest to /. 

• fcNN-join: E\ N fcJVJV E2 returns all the pairs of the form 
(ei, ei), where ei G E\ and G E2, and e2 is among the 
fc-closest points to ei. 

Queries containing two of these operations embed significant 
query processing and optimization challenges. For example, the 
well-known heuristic of pushing selections below joins [4] to re- 
duce the execution time of a query, can produce wrong results in 
the case of a fcNN-join. This is demonstrated through the follow- 
ing example. 

Assume that a car breaks while in travel. The driver needs to 
find an hotel and a mechanic shop that are close to each other. At 
the same time, the driver wants the hotel to be close to a specific 
shopping center, so that he can do shopping while the car is being 
repaired. The driver issues the following query: From the list of 
mechanic shops and the two closest hotels to each mechanic shop, 
report the (mechanic shop, hotel) pairs, where the hotel is amongst 
the two closest neighbors of the shopping center. 

Notice that this query involves a fcNN-select on the inner (right) 
relation of a fcNN-join. Figures 1 and 2 give two possible QEPs 
for the query. In both figures, black dots represent mechanic shops, 
white dots represent hotels, and the red triangle represents the shop- 
ping center. In Figure 1, the fcNN-select is performed after the 
fcNN-join, while in Figure 2, the fcNN-select is pushed below the 
fcNN-join. As the figures demonstrate, the two QEPs produce dif- 
ferent results. 

According to [19], the correct QEP for such query is the one 
in Figure 1. Pushing a fcNN-select under the inner relation of a 
fcNN-join; as a standard relational query optimizer would typically 
do; reduces the scope of the points being considered in the inner 
relation. When the fcNN-join is performed, the outer relation will 
not have the entire set of points of the inner relation to join with, and 
hence, the fcNN-join will not be performed correctly. For example, 
in Figure 2, the Mechanic relation will have nothing to join with 
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Figure 1: A QEP with the fcNN-select performed after the 
fcNN-join. fc = 2 in both predicates. The resulting pairs are: 

(mi, hi), (m 2 , hi), [m 2 , h 2 ), (m 3 , h 2 ), and (m 4 , hi). 




Figure 2: A QEP with the fcNN-select pushed below the inner 
relation of the fcNN-join. fc = 2 in both predicates. The result- 
ing pairs are: [mi, hi), [mi, h 2 ), [m 2 , hi), {m 2 ,h 2 ), (m 3 , hi), 
(m.3, h 2 ), (m.4, hi), and (m4, fe). 

except Hotels hi and /i 2 . Thus, the resulting pairs will be all the 
mechanic shops paired with either hi or h 2 , which is wrong. In 
other words, 

(£1 XlfciViv E 2 ) n x owOEa)) ^ E x N fcJVJV (<7 W (.E 2 )). 

The above example demonstrates that pushing a fcNN-select on 
the inner relation of a fcNN-join is invalid. The lack of such opti- 
mization calls for new optimization techniques that can still lever- 
age the pruning effect of selection without compromising the cor- 
rectness of evaluation. 

In addition to the above form of interaction between fcNN predi- 
cates, we study the following cases: 

• The case of a fcNN-select on the outer relation of a fcNN- 
join. This case has been added for completeness. Actually, 
pushing a selection below the outer relation of a fcNN-join 
produces correct query results. 

• The cases of two chained and unchained fcNN-joins. Since 
the fcNN-join is not a symmetric operation, the two expres- 
sions (£1 M fcJVJV E 2 ) n [E 2 Nfcjvjv E 3 ) and [Ei M kNN E 2 ) 
fl [E3 M kNN E 2 ) are not equivalent. We call the joins in the 
former expression chained (Ei — > E 2 — > E3), and those in 
the latter expression unchained. 

• The case of two fcNN-selects. 

For each of these cases, we introduce efficient algorithms that not 
only guarantee the correctness of evaluation, but also outperform 
the corresponding conceptually correct QEPs by orders of magni- 
tude. 



More specifically, the contributions of this paper can be summa- 
rized as follows. 

1. We introduce two algorithms for evaluating a query with a 
fcNN-select on the inner relation of a fcNN-join (Section 3). 

2. We study the cases of two chained and unchained fcNN-joins, 
and introduce efficient algorithms for their evaluation (Sec- 
tion 4). 

3. We study the case of two fcNN-selects, and present an effi- 
cient algorithm for its evaluation (Section 5). 

4. We conduct extensive experiments that show how our pro- 
posed techniques outperform the conceptually correct QEPs 
by orders of magnitude (Section 6). 

2. PRELIMINARIES 

We assume that the data consists of points in the two- 
dimensional space. The algorithms we present do not assume a 
specific indexing structure. The algorithms can be applied to a 
quadtree, an R-tree, or any of their variants (e.g., [14, 6, 2, 7]). 
The quadtree and its variants are hierarchical spatial data struc- 
tures that recursively partition the underlying space into blocks un- 
til the number of points inside a block satisfies some criterion (be- 
ing less/greater than some threshold). We assume that the index 
maintains the count of points in each block. We use a simple grid 
in the figures for illustration purposes. 

In this paper, we make extensive use of the two metrics: 
MINDIST and MAXDIST [13]. The MINDIST (or MAXDIST) 
between a point, say p, and a block, say 6, refers to the minimum 
(or maximum) possible distance between p and any point in b. In 
the algorithms we present, we process the blocks in a certain order 
according to their MINDIST (or MAXDIST) from a certain point. 
An ordering of the blocks based on the MINDIST or MAXDIST 
from a certain point is termed a MINDIST or MAXDIST order- 
ing, respectively. We use the terms: neighborhood and locality of 
a point [15] that are defined as follows: 

DEFINITION 1. The neighborhood of a point, say p, is the set 
of the fc nearest neighboring points to p. 

DEFINITION 2. The locality of a point, say p, is a set of blocks 
inside which the neighborhood of p exists. 

One can use any algorithm to compute the neighborhood of a 
point. In this paper, we employ the locality algorithm of [15]. 
Given a point, say p, the main idea of the algorithm is to build 
the minimum locality of p, and then compute the neighborhood of 
p only from its locality. For more detail on the algorithm, the reader 
is referred to [15]. 

3. KNN-SELECT WITH KNN-JOIN 

As discussed in Section 1, pushing a fcNN-select on the inner 
relation of a fcNN-join is invalid. However, pushing a fcNN-select 
on the outer relation of a fcNN-join is valid [19], i.e., 

(Si M kNN E 2 )n[[o- k „ <f [Ei)) x E 2 ) = [a kaJ [Ei)) M kNN E 2 . 

To illustrate the above situation, consider the scenario in Sec- 
tion 1. Assume that the driver issues the following query: From the 
list of mechanic shops and the two closest hotels to each mechanic 
shop, report the (mechanic shop, hotel) pairs where the mechanic 
shop is amongst the two closest neighbors of the shopping center. 
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Notice that in this case, the selection is on the outer (left) relation 
of the join. 

Figure 3 gives two different QEPs; QEPi and QEP2; for the 
query. In QEPi, the selection is pushed below the join while in 
QEP2, the selection is performed after the join. Clearly, both QEPs 
produce the same results. This is because as a consequence of the 
pushed selection in QEPi, some points of the outer relation will 
be excluded from the join. However, performing the join for these 
excluded points is useless as the results of the join that have any of 
these points will have to be excluded anyway if the selection is to 
be applied at the end, as in QEP2 . 



form (ei, i), where i belongs to the intersection. An illustration of 
the Counting algorithm is given in Figure 4. 




okNN ^ Hotel 

Shopping Center, 



Mechanic Mechanic Hotel 



Figure 3: Two QEPs for a query with a fcNN-select on the 
outer relation of a fcNN-join. fc = 2 in both predicates. Both 
QEPs result in the same pairs: (m 2 ,/ti), (m 2 ,m 2 ), (h 3 ,h 2 ), 
and (rri2, h-j). 

The challenge in pushing a fcNN-select 1 on the inner relation 
of a fcNN-join calls for new optimization techniques that can still 
leverage the pruning effect of selection without compromising the 
correctness of evaluation. 

In the rest of this section, we present two algorithms; Count- 
ing and Block-Marking; for evaluating a query with a fcNN-select 
on the inner relation of a fcNN-join. Formally, the two algorithms 
evaluate a query of the form (Ei N ftJVJV E2) n (Ei x a ka j(E2)), 
that retrieves the pairs (ei, 62), such that e 2 is fc^-closest to ei and 
fc CT -closest to /, where fcx is the fc value of the join, and fc CT is the fc 
value of the selection. 

The two algorithms are based on the following insight. First, we 
compute the neighborhood of / (i.e., perform the selection). Then, 
for each point ei G E\, if we can make sure that the neighborhood 
of ei cannot intersect the neighborhood of / without computing the 
neighborhood of ei, then we ignore ei as it will not contribute to 
the results of the query. Otherwise, we compute the neighborhood 
of e\, and intersect it with the neighborhood of /. The difference 
between the two algorithms is in the way they check if the neigh- 
borhood of ei cannot intersect the neighborhood of /. 

3.1 Counting Algorithm 

The Counting algorithm proceeds as follows. First, we compute 
the neighborhood of /. Then, for each point ei 6 Ei, we compute 
the distance between ei and the nearest point to e\ in the neigh- 
borhood of /. We call this distance search threshold. Then, we 
determine the count of the points in the blocks that are completely 
included within the search threshold. If the count exceeds fcxi, i.e., 
the fc value of the join, then the neighborhood of ei cannot in- 
tersect the neighborhood of /. Thus, it is useless to compute the 
neighborhood of ei. Otherwise, we compute the neighborhood of 
ei, intersect it with the neighborhood of /, and produce pairs of the 

'Notice that the same challenge exists if the selection is a spatial 
range (e.g., rectangle), or a relational attribute-based selection 
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Figure 4: The small circle to the right confines the neighbor- 
hood of / in E 2 . The search threshold is the distance between 
ei and the nearest to it in the neighborhood of /. If the count 
of the points of E2 in the gray blocks (i.e., blocks that are com- 
pletely included within the search threshold) exceeds fcxi, point 
d is ignored. 

Procedure 1 gives pseudocode for the algorithm. We assume 
the existence of Method getkNN (p, k) that returns the neigh- 
borhood of a point, say p, and Method intersect (P, Q) that 
returns the set-intersection between two sets of points, say P and 
Q. We use both methods throughout the paper. 

Procedure 1 fcNN-join fcNN-select (Counting) 
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getkNN (f, fc CT ) // Neighborhood of / 



outputPairs <— 
for (ei S Ei) do 

// Get the distance from e\ to the nearest point to it in nbr j 

searchThreshold <— distance(ei, nbr j .nearest) 
count <r- 

maxOrder 4— A MAXDIST ordering of E2 blocks from e\ 
while count < fcxi do 

block <— maxOrder.nextQ 

if MAX D I ST (block, ei) > searchThreshold then 

break 
end if 

count <— count + block. numberO f Points 
end while 

if count < fcxi then 

nbr ei <— getkN N(ei, fcxi) //Neighborhood of e\ 
intersection <— inter sectinbr j , nbr ei ) 
for (i £ intersection) do 

outputPairs. add(ei,i) 
end for 
end if 
end for 

return outputPairs 



To determine the count of points in the blocks of E2 that are com- 
pletely included within the search threshold, we scan the blocks of 
the index of E2 in increasing order of their MAXDIST from e\. 
We keep accumulating the count of the points in the encountered 
blocks. As mentioned in Section 2, we assume that the index stores 
the count of the points in each block. Once a block, say BM, hav- 
ing its MAXDIST greater than the search threshold is encountered, 
we stop (see Line 11). The reason is that BM and the ones to fol- 
low are not completely included within the search threshold. Also, 
we stop if the number of points in the encountered blocks exceeds 
fcM (see Line 8). In this case, processing more blocks would result 
in a count that is also greater than fcw. 
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3.2 Block-Marking Algorithm 

The Block-Marking algorithm proceeds as follows. First, we 
compute the neighborhood of /. Then, before performing the join, 
we perform a preprocessing step for all the blocks of E\ . For each 
block, we determine whether points located inside the block can 
contribute to the results of the query or not. If it is the case that 
no point ei £ E\ in the block can contribute to the results of the 
query, we mark the entire block Non-Contributing. Otherwise, the 
block is marked Contributing. 

After the preprocessing step, we scan the Contributing blocks of 
Ei. Non-Contributing blocks are ignored. For each point ei in 
a Contributing block, we compute ei's neighborhood, intersect it 
with the neighborhood of /, and produce pairs of the form (ei, i), 
where i is a point that belong to the intersection. Procedure 2 gives 
pseudocode for the algorithm. Line 2 calls the preprocessing step 
through Procedure 3 listed next. 



Procedure 2 fcNN-join fcNN-select (Block-Marking) 

1: nbrf <— getkNN(f, k a ) II Neighborhood of / 

2: contriburing Blocks <— preprocess(nbr f) 

3: output Pairs <— 

4: for (block £ contriburing Blocks) do 

5: for (ei £ block) do 

6: nbr ei <— getkNN(e\ , fctxj) // Neighborhood of e\ 

7: intersections- inter sect(nbr f , nbr ei ) 

8: for (i £ intersection) do 

9: outputPairs.add(ei,i) 
10: end for 
1 1 : end for 

12: end for 

13: return output Pairs 



3.2.1 Efficient Preprocessing 

To determine whether a block is Contributing or not, we compute 
the neighborhood of the center of the block. 2 Then, the distance 
between the center and the farthest of its neighbors is determined, 
and is added to the length of the diagonal of the block forming a 
search threshold. If no point in neighborhood of / is within the 
search threshold, then we mark the entire block Non-Contributing. 
In this case, any point, say p, in the block will have /cm or more 
points that are nearer to p than any point in the neighborhood of /. 

An equivalent, yet cheaper check can be described as follows. 
Refer to Figure 5 for illustration. Consider a block, say NC, e.g., 
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Figure 5: A block is marked Non-Contributing if (r + d + 

/farthest*) ^ f center* 

2 We discuss the reason behind choosing the center of the block 
later in this section. 



the gray block in Figure 5. Let r be the distance between the center 
of NC and the farthest of NC's neighbors, d be the length of the 
diagonal of NC, and f farthest be the distance between / and the 
farthest of f's neighbors, and /center be the distance between / 
and the center of NC. NC is marked Non-Contributing if: 

(r -|- d -j- f farthest*) f center- 

A brute-force approach for the preprocessing phase is to scan 
each block in E\ , compute the neighborhood of its center, and per- 
form the check described above to determine whether the block is 
Contributing or not. A more efficient approach is described below. 

We scan the blocks of E\ in MINDIST order from /. When a 
block, say NC, is marked Non-Contributing, the MAXDIST, say 
M, between NC and / is determined. If all the following en- 
countered blocks are also marked Non-Contributing, then we stop 
scanning any more blocks when we encounter a block of MINDIST 
at least M. Otherwise, if any of the next encountered blocks is not 
marked Non-Contributing, then this cycle is repeated. The idea of 
this approach is to determine a contour (complete cycle) of blocks 
such that all the blocks in the contour are Non-Contributing. All the 
blocks outside that contour are considered Non-Contributing with- 
out further processing. This is illustrated in Figure 6. Procedure 3 
gives pseudocode for the preprocessing phase. 
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Figure 6: The preprocessing phase. The green block is a Non- 
Contributing block. All the next scanned blocks are also Non- 
Contributing (the contour of gray blocks). Processing stops 
when the red block is encountered, since its MINDIST from 
/ equals the MAXDIST of the green block from /. All the 
next blocks (outside the gray contour) are considered Non- 
Contributing without further processing. 

3.2.2 Why Choose the Center of the Block? 

An important question to address is: If we choose any location, 
say c, other than the center of the block, will this result in a tighter 
(smaller) search threshold without falsely marking the block Non- 
Contributing? 

THEOREM 1. The search threshold is minimum if c is the cen- 
ter of the block. 

PROOF. The search threshold is determined by: 

1. the distance between c and the farthest of its neighbors, and 

2. an added distance, say x, that is the length of the diagonal of 
the block in case c is the center of the block. 

The purpose of the added distance x is to cover the neighborhood 
of any point in the block, i.e., guarantee that the neighborhood of 
any point in the block does not intersect the neighborhood of /. 
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Procedure 3 Preprocess Blocks (Block-Marking) 

Terms: nbr f : The neighborhood of /. M: MAXDIST between / and the 
first Non-Contributing block encountered in the cycle (e.g., the green 
block in the figure). 

1 : II j farthest is the distance between / and the farthest of its neighbors 

2: f farthest distance(f , nbr f . f arthest) 

3: contributing Blocks <— 

4: M «- 

5: minOrder <— A MINDIST ordering of E± blocks from / 

6: for (block £ minOrder) do 

7: if (block.MINDIST(f) > M) then 

8: break // All the remaining blocks are Non-Contributing 

9: end if 

10: nbr <— getkNN (block. center, &m) // Neighborhood of center 

11: // r is the distance between center and the farthest of its neighbors 

12: r <r- distance(block. center, nbr. farthest) 

13: f center <— distance(block . center , f) 

14: if (r + block. diagonal + f farthest < f center) then 

15: // Non-Contributing block 

16: if (M = 0) then 

17: // First Non-Contributing block in the cycle 

18: M <- block. M AX DIST(f) 

19: end if 

20: else 

2 1 : contributing Blocks .add(block) 

22: M «- // Start another cycle 

23: end if 

24: end for 

25: return contributing Blocks 



Assume that we randomly select the location of c, and compute 
its neighborhood. Refer to Figure 7 for illustration. The farthest 
location to c in the block is the top-left corner of the block, say 
t. ct — j/. 3 Point a is the farthest point to c in its neighborhood. 
ac = r. Point b is the nearest point to t in the neighborhood of /. 
The region bounded by the search threshold does not intersect the 
neighborhood of / as shown. 




Figure 7: The effect of choosing any point other than the center 
of the block to compute the neighborhood for. x = 2y is a 
tight lower bound for the added distance x that guarantees the 
correct coverage of the search threshold. 

Observe that in Figure 7, we illustrate a bounding case in which 
the three positions a, b, and c are collinear and are on the diagonal 
of the block (or its extension). Point t is in the middle of the dis- 
tance between Points a and b, i.e., ta — tb = (y + r). Any point 
inside the block other than t will have distance to Point a that is 
< (y + r), and also will have distance to point b that is > (y + r). 

3 To refer to the distance between two points, say p\ and p2, we use 
the notation pTpia- 



If x > 2y then tb > (y + r). For any point inside the block, the 
distance to Point a will be < {y + r), and the distance to point b 
will be > (y + r). This means that the neighborhood of any point 
in the block cannot intersect with the neighborhood of /, i.e., the 
block is correctly marked Non-Contributing. 

If x < 2y then tb < (y + r). For Point t, Point b will be nearer 
than Point a. So, even though no point in the neighborhood of / 
is within the search threshold, the neighborhood of a point at the 
top-left corner will intersect the neighborhood of /, i.e., the block 
is falsely marked Non-Contributing. 

Thus, x = 2y is a tight lower bound for the added distance x. 
And since y is the distance from c to the farthest corner of the block, 
y is minimum if c is the center of the block. For this reason, the 
search threshold is minimum if c is the center of the block. □ 

3.3 Counting vs. Block-Marking 

An important question to address is: How do we choose be- 
tween the Counting and Block-Marking algorithms? Observe that 
the Counting algorithm does not require a preprocessing phase, i.e., 
once the query is issued, points of the outer relation are processed. 
However, the Block-Marking algorithm requires a preprocessing 
phase to determine the Contributing and Non-Contributing blocks. 
Although this is a winning point for the Counting algorithm, the 
Block-Marking algorithm always has better opportunities for being 
faster. 

In the Counting algorithm, for every point in the outer relation, 
the number of points in the blocks that are within the search thresh- 
old has to be determined. In other words, the Counting algorithm 
poses a per-tuple overhead. On the other hand, the Block-Marking 
algorithm has a per-block overhead (to determine the Contributing 
blocks). Furthermore, as discussed in Section 3.2.1, this per-block 
overhead does not affect all the blocks of the outer relation. The 
reason is that the preprocessing phase stops when a contour of Non- 
Contributing blocks is encountered. 

As we illustrate in Section 6, when the number of points in the 
outer relation is small, the Counting algorithm has better perfor- 
mance. In this case, because the density of the points is relatively 
low, the overhead of the preprocessing phase of the Block-Marking 
algorithm is relatively high as it requires computing the neighbor- 
hood of the centers of many blocks without significant payoff. On 
the other hand, when the number of points in the outer relation is 
relatively high, i.e., high density, the Block-Marking algorithm has 
better performance because entire blocks will be excluded from the 
join. On the contrary, the Counting algorithm will have to process 
every point. 

4. TWO KNN- JOINS 

As mentioned in Section 1, the fcNN-join is not a symmetric op- 
eration, i.e., the two expressions (Ei MkNN E%) (~l (E% M kNN E3) 
and (Ei N feJV jv E2) D (E3 M feJ viv E2) are not equivalent. We call 
the joins in the former expression chained (Ei — > E2 — > E3), and 
the joins in the latter expression unchained. 

4.1 Unchained kNN- Joins 

Consider a query on three data sets, say A, B, and C. The query 
is to retrieve the triplets (a, b, c), where a £ A, b £ B, and c £ C, 
such that b is a Ica—b nearest neighbor of a, and 6 is a fcc-s nearest 
neighbor of c. Figures 8 and 9 give two possible QEPs for the 
query. In both figures, solid lines indicate the fcNN-join performed 
first, and dashed lines indicate the fcNN-join performed at the end. 

Although both QEPs seem to be legitimate, they produce differ- 
ent results; surprisingly none of them is correct. The reason is that 
if either join is performed first, then it filters out the input of the 
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Figure 8: (A M feJVAr B) is evaluated before (C N fcJVJV B). 
kA-B = kc-B = 2. The resulting triplets are: (oi,6i,ci), 
(01,61,02), (02,61,01), (02,61,02), (oi, 62,01), (01,62,02), 
(02, 62, ci), and (02,62,02). 
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Figure 9: (C M fcivJV B) is evaluated before (A N ftJVJV B). 
fcA-B = fec-B = 2. The resulting triplets are: (01,63,01), 
(01,63,02), (02,63,01), (02,63,02), (01,62,01), (01,62,02), 
(02, 6 2 , ci), and (02,62,02). 



inner relation of the other join. For example, in Figure 8, when 
(A i^kNN B) is performed first, point 63 is filtered out and will 
not be in the neighborhood of any point c G C. Similarly, in Fig- 
ure 9, when (C M kNN B) is performed first, point 61 is filtered 
out and will not be in the neighborhood of any point a G A Each 
QEP is equivalent to pushing a selection on the inner relation of a 
fcNN-join, which has been proven to be invalid earlier in the paper. 

According to [19], to evaluate a query with two unchained fcNN- 
joins, each join has to be evaluated independently. The results of 
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Figure 10: The two joins (C M fcJVJV B) and (A M kNN B) 
are evaluated independently, fc^-s = kc-B = 2. The re- 
sulting triplets are: (01,62,01), (01,62,02), (02,62,01), and 
(02, 6 2 , c 2 ). 

the two joins are combined using some operation that has the same 
flavor as intersection. This operation takes as input the two sets 
of pairs of the outputs of the two joins, and returns the matching 



pairs that have the same B component, i.e., intersects the two sets 
of pairs on B, which we denote by Hb- This is illustrated in the 
QEP in Figure 10. 

4.1.1 Efficient Evaluation 

Consider the QEP in Figure 10 for evaluating unchained fcNN- 
joins. Notice that because the two joins are evaluated indepen- 
dently, we can start with either join. Without loss of generality, 
assume that the execution starts by evaluating the join (A txl fc jviv 
B). We study the issue of choosing the optimal join order later 
in this section. This QEP is efficient if every point c G C is part 
of the final results of the query. As we show next, if some points 
in C do not contribute to the results of the query, computing their 
neighborhood is redundant, and can be avoided without losing the 
correctness of evaluation. This is illustrated in Figure 1 1 that shows 
the distribution of the data sets A, B, and C. 




Figure 11: For points in Circle L, the join (C M kNN B) is re- 
dundant and its computation can be avoided. 

In Figure 11, points of Set A are in Circle Z, points of Set B 
are divided between Circles X and M, and points of Set C are 
divided between Circles Y and L. The points in Circle M confine 
the neighborhood of the points in Circle L. The points in Circle 
X confine the neighborhood of the points in Circle Y. The points 
in Circle X confine also the neighborhood of points in Circle Z. 
For all the points in Circle L, performing the join (C N fcAr jv B) 
is redundant because its result will never intersect the result of the 
join (A txl fc jvjv B) as the join result of the latter is fully contained in 
Circle X. On the other hand, for the points in Circle Y, performing 
the join (C N fcJVJV B) is essential, because its result will be in 
Circle X that also contains the result of the join (A MkNN B). 

To efficiently evaluate a query with two unchained fcNN-joins (^4 
MkNN B) and (C Xfcivjv B), we follow the following procedure. 
After evaluating the join (A Nj.jvjv B), we determine the blocks of 
B that contain points 6 G B that belong to the resulting pairs (a, 6), 
where a G A. We mark these blocks as Candidate blocks. All the 
other blocks are marked as Safe blocks. For example, in Figure 11, 
Circle X is a Candidate block, and Circle M is a Safe block. 

Before evaluating the join (C M kN N B), we do a preprocessing 
step similar to the preprocessing step of the Block-Marking tech- 
nique in Section 3.2. In this preprocessing step, we scan all the 
blocks of C to determine the blocks that are contributing or non- 
contributing to the results of the query. For each block, we compute 
the neighborhood of its center. Then, the distance from the center 
to the farthest point in its neighbors is determined, and is added to 
the length of the diagonal of the block to form a search threshold as 
in Figure 12. We mark the block Non-Contributing if all the blocks 
that are fully or partially contained within the search threshold are 
Safe. 

After the preprocessing step, we scan the Contributing blocks of 
C. Non-Contributing blocks are ignored. For each point, say c, 
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Figure 12: gray blocks are Candidate blocks. White blocks are 
Safe blocks. A block is Non-Contributing if the blocks that are 
fully or partially contained within its search threshold are Safe. 



in a Contributing block, we compute c's neighborhood, and pro- 
duce pairs of the form (c, b) that we intersect on B (i.e., fls) with 
the computed pairs of the join (A M kNN B). Procedure 4 gives 
pseudocode for the algorithm. 

Procedure 4 Unchained fcNN-joins (Block-Marking) 

Terms: A, B,C: The input relations of the two joins. Ua— b, kc—B- The 
k values of the joins (A M kNN B) and (C M kNN B), respectively. 

1: //Perform the join (A M fcJVJV B) 

2: ABpairs -h- kNNJoin(A, B, Ua-b) 

3: BPointsInAB <— project(ABpairs) II Project on B 

4: // Determine Candidate blocks of C (a block is Safe by default) 

5: for (b e BPointsInAB) do 
6: block <— C. index. locate(b) 
7: block.isSafe <— false 

8: end for 

9: // Preprocess the blocks of C to determine the Contributing ones 

10: contributing Blocks •<— 

11: for (block £ C. index) do 
12: if (block.isSafe = false) then 
13: contributing Blocks. add(block) 

14: else 

15: nbr <— getkN N (block. center, kc— b) 

16: r <— distance(block. center, nbr. farthest) 

17: searchThreshold <— r + block. diagonal 

18: if (any block within searchThreshold is Candidate) then 

19: contributing Blocks .add(block) 

20: end if 

21: end if 

22: end for 

23: // Perform the join (C IXfcjviv B) and intersect on B 

24: outputTriplets +— 

25: for (block £ contriburing Blocks) do 

26: for (c £ Mocfc) do 

27: nbr c <— getkN N (c, kc_g) II Neighborhood of c 

28: for ((a, b) £ ABPairs) do 

29: if (b £ nbr c ) then 

30: outputTriplets .add(a, b, c) 

31: end if 

32: end for 

33: end for 

34: end for 

35: return outputTriplets 



A simple optimization for the preprocessing phase is to process 
only the Safe blocks. This is because a Candidate block is never 
marked Non-Contributing as its center is not contained in a Safe 
block (refer to the check in Line 12 of Procedure 4). 



4.1.2 Join Order 

In the QEP of Figure 10, each fcNN-join is evaluated indepen- 
dently. Thus, changing the order of the two unchained fcNN-joins 
leads to the same results for the query. However, choosing which 
join to evaluate first can affect the number of Candidate and Safe 
blocks, and hence directly impacts the number of Non-Contributing 
(pruned) blocks in the second join. Hence, the question: Which of 
the joins (A M kNN B) and (C M kNN B) should be evaluated first? 

Consider the case when the points in A and B are uniformly 
distributed and cover the whole space, while the points in C are 
clustered inside a certain region, say R. If we perform the join (A 
M fe jviv B) first, there will be no Safe blocks because the neighbor- 
hood of the points of A will cover all the blocks in B due to the 
uniformity in data distribution. This means that all the blocks of 
C will be Contributing, i.e., no pruning will take place. On the 
other hand, if we perform the join (C txlfcjvjv B) first, the Candi- 
date blocks will be only in Region R and its surroundings. This 
means that there will be several Safe blocks. This will result in 
Non-Contributing blocks in A that are pruned during the other join 
{A IXfejvAr B). 

In conclusion, considering A and C as the outer relations of two 
unchained fcNN-joins: 

• If either A or C is clustered, the evaluation of the query 
should start with the join of the clustered relation. As a 
consequence, blocks of the inner relation (e.g., B) will have 
higher chance to be Safe. This would maximize the num- 
ber of Non-Contributing blocks in the outer relation of the 
second join, and hence these blocks will be pruned. 

• If both A and C are clustered, the evaluation of the query 
should start with the join of the relation that has less cluster 
coverage, i.e., the relation with clusters that cover smaller 
area. This increases the chance of pruning in the second join. 

• If both A and C are uniformly distributed, it is better to use 
the QEP of Figure 10, i.e., perform both joins independently. 
If Procedure 4 is applied, then there will be a preprocessing 
overhead (to mark the blocks) without payoff. The reason is 
that all the blocks of the outer relation of the second join will 
be Contributing, i.e., no pruning will occur. 

In Section 6.2.1, we exploit various data distributions and cluster 
setups that demonstrate the effects depicted in the above cases. 

4.2 Chained kNN- Joins 

Consider a query on three data sets, say A, B, and C. The query 
is to retrieve the triplets (a, b, c), where a £ A, b £ B, and c £ C, 
such that b is a fc.4-s nearest neighbor of a, and c is a fcc-s nearest 
neighbor of b. The query can be evaluated in a variety of ways as 
Figure 13 illustrates. The three QEPs in the figure produce the same 
results for the query, i.e., the following relation holds [19]: 

(A Xl fc jviv B) n (B M kNN C) = 
(A Mfcjvjv B) N fc]V jv C = 
A M kNN (B N fc]VJV C). 

The correctness of the above relation can be explained as fol- 
lows. The join (A N fcJV jv B) can be viewed as a selection on the 
outer relation of the join (B N fcArjv C) (i.e., selection on B). Sim- 
ilar to the discussions in Section 3, pushing a selection on the outer 
relation of a fcNN-join does not affect the correctness of evaluation. 
That is why performing the join [A N fc]v]v B) before or after the 
join (B N fcAr jv C) leads to the same results. 
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Figure 13: k A -B = k B -c = 2. Three different QEPs for 
a query with two chained fcNN-joins. The three QEPs re- 
sult in the same triplets: (ai,&2,ci), (ai, 62,02)5 (02,62,01), 
(02,62,02), (ai,6 3 ,c 2 ), (ai, 63,04), (02, 63, c 2 ), and (a 2 ,6 3 ,c 4 ). 



4. 2. 1 Efficient Evaluation 

Although the three QEPs in Figure 13 produce the same results, 
they have different performance. The following points illustrate the 
pros and cons of each QEP. 

• QEPi is a right deep plan; the results of the join (B M kNN 
C) have to be materialized before proceeding with the other 
join. This is a major drawback, because no output can be pro- 
duced until after the join (B N fcArjv C) is complete. More- 
over, performing the join (B M feJ v]v C) first implies that 
some redundant computations will be performed, e.g., get- 
ting the neighborhood of 61 although it will never appear in 
the results of the query as it is not in the neighborhood of any 
point a £ A. 

• QEP2 has an extra operator; C\b ; to intersect the results of 
both joins on B. Moreover, QEP2 suffers the same redun- 
dant computations as QEPi, since QEP2 blindly computes 
the neighborhood of every point b € B regardless of whether 
or not b appears in the results of the query. 

• QEP3 avoids the redundant computations of QEPi and 
QEP 2 . The neighborhood of a point b £ B is computed 
only if b is produced as a nearest neighbor to a point a £ A. 
Thus, computing the neighborhood of 61 is avoided in this 
QEP. This results in remarkable performance gains for QEP3 
in comparison to QEPi and QEP 2 especially for relations 
that have clusters of points. Clusters of points in B that are 
not in the neighborhood of any point a £ A are pruned in 
the joins of QEP3. However, both QEPi and QEP2 will have 
to process all the clusters. On the other hand, QEP 3 suffers 
some repeated computations. In particular, this happens for 
every point b that is in the neighborhood of more than one 
point in A. For example, computing the neighborhood of 62 
is performed twice because 62 appears in the neighborhood 
of both ai and a 2 . Similarly, the neighborhood of 63 is com- 
puted twice. 



To avoid the repeated computations in QEP3, we cache the re- 
sults of the join (B M feJVJV C) in a hash table, where 6 6 Bis the 
key, and the value is the neighborhood of b. Whenever a pair (a, 6) 
is produced from the join (A Mfcjvjv B), the hash table is probed to 
check if an entry corresponding to b exists. If such entry exists, the 
neighborhood of b is retrieved from the hash table. Otherwise, the 
neighborhood of b is computed. As we show in Section 6, caching 
the results of the join (B Xl fcJV jv C) significantly improves the per- 
formance of QEP3, and thus outperforms both QEPi and QEP2. 

5. TWO KNN-SELECTS 

5.1 Correct Conceptual Evaluation 

When two fcNN-select predicates are combined in a single query, 
different QEPs that seem to be legitimate can produce different re- 
sults. The following example illustrates such ambiguity in the eval- 
uation of a query with two fcNN-selects. 

Assume that a person gets a new job in a city different from 
where he lives. He decides to move with his family to the new city, 
and considers buying a new house such that the new house is close 
to both his work and the school of his children. He wants to select 
candidate houses to choose from such that these houses are among 
the closest five houses to both his work and the school. 

Figures 14 and 15 give two different QEPs for the above query 
with the corresponding resulting houses. In both figures, solid lines 
indicate the fcNN-select predicate performed first, and dashed lines 
indicate the fcNN-select predicate performed second. 




House 



Figure 14: A QEP with a kNN>Wo rk(House) performed before 
o~kNN,Schooi(House). The resulting houses are: x, y, I, m, and 
z. 
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Figure 15: A QEP with crkNN,Schooi(House) performed before 
<JkNN,work(House). The resulting houses are: x, y, n, p, and 
o. 

Although the QEPs in Figures 14 and 15 seem legitimate, they 
produce different results. Surprisingly, both results are wrong. The 
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reason is that when any of the two fcNN selects is performed first, 
it filters out the input of the other fcNN select. The scope of the 
fcNN select performed at the end will be limited to only the fc 
points that qualify the first fcNN select. For example, in Figure 14, 
o~kNN, School (House) has nothing to select from except the five 
houses that oun n ,Work(H ouse) returns. Similarly, in Figure 15, 
CkNN,Work{House) has nothing to select from except the five 
houses that OkNN,Schooi(House) returns. 

According to [19], for the above query to be correctly evalu- 
ated, each fcNN-select predicate has to be evaluated independently. 
Then, the results of applying both predicates are intersected. This 
is illustrated in the QEP in Figure 16. 
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Figure 17: The search threshold is the distance between / 2 and 
the farthest to it in the neighborhood of /i. The gray blocks 
represent the locality of / 2 . A block is added to the locality of 
/ 2 if its MINDIST from fa is less than the search threshold. 



tees that the neighborhood of /i is included in the locality of / 2 
and in turn, the final result of the query. 



Figure 16: The correct QEP for a query with two fcNN-select 
predicates. Each predicate is evaluated independently, and the 
results are intersected. The resulting houses are: x and y. 

5.2 Efficient Evaluation 

The QEP in Figure 16 for evaluating a query with two fcNN- 
selects, say Ok 1 ,f 1 (E) and cr& 2 ,/ 2 (£), is efficient if ki = fc 2 . If 
fci 7^ k-2, the above QEP suffers some redundancy as the following 
discussions demonstrate. 

Consider a query that has the two fcNN-selects a , 5j 1 (E) and 
°"ioo,/2 (E), i.e., fci = 5 and fc 2 = 100 (i.e., fc 2 has a value that is 
significantly greater than fci). As mentioned in Section 2, in order 
to compute the neighborhood of a point, say p, the locality is first 
determined. Then, the points inside the blocks of the locality are 
processed in order to get the closest fc points to p. The standard 
approach (as in [15]) of computing the locality is to keep adding 
blocks to the locality until a total of fc points is reached in the en- 
countered blocks. If this approach is applied to crioo,/ 2 (E), the 
locality of / 2 will be large and will cover almost the entire space in 
which the points reside. In other words, almost all the blocks will 
be in the locality of / 2 , and will have to be processed in order to 
find the neighborhood of / 2 . 

The above approach for computing the locality of / 2 is not ef- 
ficient because it does not consider the neighborhood of /i. In 
particular, the number of blocks in the locality of / 2 can be smaller 
and still produce correct results. This can be achieved by observ- 
ing that the neighborhood of /i is completely included inside the 
locality of / 2 . 

Because the final result of the query is determined by intersect- 
ing the neighborhoods of /i and / 2 , this final result cannot include 
points other than the neighborhood of /i . Consequently, once the 
neighborhood of /i is determined, the locality of / 2 can be ad- 
justed to cover just the neighborhood of f\. We define the search 
threshold as the distance between / 2 and the farthest to it in the 
neighborhood of /i . A block, say b, is added to the locality of / 2 
only if the MINDIST between b and / 2 is less than or equal to the 
search threshold. Refer to Figure 17 for illustration. This guaran- 



Procedure 5 2-fcNN-select 

Terms: nbr\,nbr2'- The neighborhoods of /i and / 2 , respectively. 

1 : if fci > fc 2 then 

2: swap(fci, fc 2 ) 

3: swap(/i,/ 2 ) 

4: end if 

5: nbr 1 <- getkNN(f 1 , fci) 

6: searchThreshold <— distance(/ 2 , nbr^.f arthestTof2) 

7: f 2 -locality 

8: count <— 

9: maxDistSoFar «— 

10: // Process the blocks in MAXDIST order from / 2 

1 1 : while count < fc 2 do 

12: block <— maxOrder.next() 

13: count <— count + block. number O j Points 

14: maxDistSoFar <- MAXDIST [block, / 2 ) 

15: if MINDIST(block,f 2 ) < searchThreshold then 

16: f 2 .locality .add(block) 

17: end if 

18: end while 

19: // Process the remaining blocks in MINDIST order from / 2 

20: for (block £ minOrder) do 

21: if MAXDIST (block, / 2 ) < maxDistSoFar then 

22: if MINDIST(block, f 2 ) < searchThreshold then 

23: f 2 -locality .add(block) 

24: else 

25: break 

26: end if 

27: else 

28: break 

29: end if 

30: end for 

31: // Determine the neighborhood of / 2 from its locality 

32: nbr 2 <— getNeighborhood(f2, f2-locality) 

33: return inter sect(nbr\,nbr2) 



Procedure 5 gives pseudocode for evaluating two fcNN-select 
predicates. The procedure starts by computing the neighborhood 
of /i, i.e., evaluating the predicate with smaller fc. To compute the 
neighborhood of / 2 , / 2 's locality is determined using a slightly dif- 
ferent version of the algorithm in [15]. In [15], to determine the 
locality of a point, say p, the blocks of the index are processed in 
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increasing order of their MAXDIST fromp, and are added to the lo- 
cality. The counts of the number of points in the blocks are summed 
up until the total number of points in the encountered blocks ex- 
ceeds fc. At this moment, the current value of the MAXDIST, say 
M, is recorded. Afterwards, the remaining blocks are processed 
in increasing order of their MINDIST from p, and are added to the 
locality until a block the MINDIST of which exceeds M is encoun- 
tered. All the remaining blocks need not be examined. This pro- 
cedure for building the locality is proven to guarantee the optimal 
(minimum) possible number of blocks [15]. We follow the same 
procedure for computing the locality of f% except that a block, say 
6, is added to the locality of fa only if the MINDIST between b and 
f% is less than or equal to the search threshold. Refer to Lines 15 
and 22, and 25 of Procedure 5. Notice that in Line 25, scanning the 
blocks in MINDIST order stops when a block of MINDIST greater 
than the search threshold is encountered. 

6. EXPERIMENTAL RESULTS 

In this section, we study the performance of the proposed op- 
timization techniques. We measure the query execution time. To 
compute the neighborhood of a point, we implement the locality 
algorithm as in [15]. All implementations are in Java. Experiments 
are conducted on a machine running Windows 7 with Intel Core2 
Duo CPU at 2. 1 GHz and 4 GB of main memory. 

Our datasets are mainly generated using BerlinMOD [3]; a 
benchmark for spatio-temporal database management systems. The 
data is downloadable through the BerlinMOD website [1] with 
scale-factor 1.0. In BerlinMOD, about two thousand cars report 
their movement over Berlin City for 28 days. We remove the time 
dimension from the data to deal with snapshots of points. Depend- 
ing on the kind of experiment, we vary the number of points in the 
datasets, from 32,000 to 2,560,000 data points. A sample snapshot 
of the data is given in Figure 18. In addition to the BerlinMOD 
data, and in order to demonstrate some specific effects, we gener- 
ate our own synthetic data. In particular, for some experiments, we 
generate clustered data and vary the number of clusters. 



6.1 kNN-Select with kNN-Join 

In the following experiments, we study the performance of 
the two proposed algorithms, Counting and Block-Marking, for a 
query with a fcNN-select on the inner relation of a fcNN-join. Fig- 
ure 19 illustrates that the Block-Marking algorithm outperforms the 
conceptually correct QEP by orders of magnitude. Blocks of points 
of the outer relation that do not contribute to the results of the join 
are detected and are excluded from the join operation. From the 
figure, increasing the number of points in the outer relation empha- 
sizes the pruning effects of the algorithm. 




Outer Table Size in Thousands 




Outer Table Size in Thousands 



Figure 19: Execution time of a query with a fcNN-select on the 
inner relation of a fcNN-join. The Block-Marking algorithm 
outperforms the conceptually correct evaluation plan by three 
orders of magnitude. 

Figures 20 and 21 compare the performance of the Counting and 
Block-Marking algorithms. In Figure 20, the number of points in 
the outer relation is lower than those in Figure 21. As the figures 
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Figure 20: Execution time of a query with a fcNN-select on the 
inner relation of a fcNN-join. The Counting algorithm has bet- 
ter performance than the Block-Marking algorithm when the 
number of points in the outer relation is low, and vice versa. 
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Figure 18: A sample snapshot of BerlinMOD data plotted on 
the map of Berlin City. 

We index the data points into a simple grid. Since our algorithms 
are independent of a specific indexing structure, we choose a grid 
in order to be able to see the effectiveness of our algorithms even 
with simple structures. We expect our algorithms to maintain the 
same effectiveness (if not better) with more robust index implemen- 
tations, e.g., using variants of the R-tree or the quadtree. 



Figure 21: Execution time of a query with a fcNN-select on the 
inner relation of a fcNN-join. The Block-Marking algorithm 
has much better performance than the Block-Marking algo- 
rithm when the number of points in the outer relation is high. 

demonstrate, when the number of points in the outer relation is 
small, the Counting algorithm has better performance. In this case, 
the density of the points is relatively low, and the overhead of the 
preprocessing phase required by the Block-Marking algorithm is 
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relatively high because it requires computing the neighborhood of 
the centers of many blocks without much payoff. On the other 
hand, when the number of points in the outer relation is high, i.e., 
the outer relation has high density, the Block-Marking algorithm 
has better performance because entire blocks are excluded from 
the join. On the contrary, the Counting algorithm processes every 
point. 

6.2 Two kNN- Joins 

6. 2. 1 Unchained kNN- Joins 

In the following experiments, we study the performance of the 
Block-Marking algorithm for a query with two unchained fcNN- 
joins, e.g., (A N fcArjv B) and (C tX feJVAr B). As mentioned in 
Section 4.1.2, if both A and C are uniformly distributed, then it 
is better to use the conceptually correct QEP of Figure 10, i.e., 
perform both joins independently, than to use the Block-Marking 
algorithm. In that case, if the Block-Marking algorithm is applied, 
then there will be a preprocessing overhead without payoff. 

To demonstrate the pruning effects of the Block-Marking algo- 
rithm, we have the following experimental setup. Points of B and 
C are generated using BerlinMOD. Points of A are generated such 
that they are clustered inside a certain region. We fix the num- 
ber of points in A and B, and vary the number of points in C. 
Figure 1 1 illustrates that the Block-Marking algorithm can outper- 
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Figure 22: Execution time of a query with two unchained fcNN- 
joins (A N feJVJV B) and (C N fc]VJV B). B and C are uniformly 
distributed, and A is clustered. The Block-Marking algorithm 
outperforms the conceptually correct QEP by an order of mag- 
nitude. 

form the conceptually correct QEP by an order of magnitude. As 
the figure demonstrates, the Block-Marking algorithm almost has 
constant performance because it detects the blocks of C that do not 
contribute to the results of the query, and excludes them from the 
join (C Mfejviv B). However, the conceptually correct QEP has to 
perform the join for all the points in C regardless of the layout of 
the data. 

If both A and B are clustered, then applying the Block-Marking 
technique can also result in good performance gains. In this case, 
the evaluation of the query should start with the join of the relation 
that has less cluster coverage, i.e., the relation the clusters of which 
cover smaller area. This gives a higher chance for pruning effects 
in the second join. 

To demonstrate this effect, we have the following experimental 
setup. Points of B are generated using BerlinMOD. We generate 
clusters of points in A and C. All the clusters have the same num- 
ber of points (4000), have the same area, and are non-overlapping. 
We vary the number of clusters such that the number of clusters 
in A is greater than the number of clusters in C by 1, 2, . . . , 10. 
Figure 23 illustrates that starting the evaluation with (C M kN N B) 
results in better performance than starting with (^4 M fcAW B). If 



the evaluation starts with (C M kNN B), the Block-Marking algo- 
rithm detects the clusters of points in A that do not contribute to 
the results of the query and excludes them from the join (A Mfcjvjv 
E). However, starting with (^4 N fcJVJV B) will fully compute the 
join for all the clusters without exclusion. 
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Figure 23: Execution time of a query with two unchained fcNN- 
joins (A N fcJV jv B) and (C IX feJVJV B). A and C are clustered. 
Varying the difference between the number of clusters in A and 
C; when the number of clusters in C is smaller, starting with 
(C N fcJVJV B) results in better performance. 



6. 2. 2 Chained kNN -Joins 

In the following experiments, we study the performance of the 
three QEPs of Figure 13, for a query with two chained fcNN-joins, 
e.g., (A N fcJV jv B) and (B M kNN C). For illustration, we call 
QEP3: Nested Join, and QEP2: Join Intersection. 

As discussed in Section 4.2, there are two versions of the Nested 
Join QEP; one that caches the results of the join (B M kNN C) 
in a hash table to avoid repeating join computations, and another 
version that does not do any caching. Figure 24 illustrates that 
caching the results of the join (B N fciV jv C) significantly enhances 
the performance. 




Figure 24: Execution time of a query with two chained fcNN- 
joins (A Nfcjvjv B) and (B M kNN C). Caching the results of 
the join (BiA kNN C) significantly enhances the performance. 

As discussed in Section 4.2, the Join Intersection QEP performs 
the two joins (A M kNN B) and (B M fe jvjv C) independently, and 
then intersects their results on B (i.e., Hs. However, the Nested 
Join QEP performs the join (B M kNN C) only for points b G B 
that are in the neighborhood of one or more points in A. When 
comparing the two QEPs, we find that both plans have almost 
the same performance if the data points are uniformly distributed. 
However, as Figure 25 demonstrates, for clustered data, the Nested 
Join QEP has better performance. We use the version of the Nested 
Join QEP that caches the results of the join (C N fc jvjv B). As 
the number of clusters in B increases, the Nested Join QEP out- 
performs the Join Intersection QEP. This is because the Join Inter- 
section QEP blindly does both joins without any kind of pruning. 
However, clusters of points in B that are not in the neighborhood 
of any point in A are pruned by the Nested Join QEP. 
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Figure 25: Execution time of a query with two chained fcNN- 
joins (4 MtivN B) and (B M feJVAr C). Performance when vary- 
ing the number of clusters in B. 



6.3 Two kNN-Selects 

In the following experiment, we study the performance of the 
2-fcNN-select algorithm, for a query with two fcNN-select pred- 
icates, e.g., (Jk 1 j 1 (E) and ak 2 j 2 (E). Unlike the 2-fcNN-select 
algorithm, the conceptually correct QEP fully computes the two 
fcNN-selects and then intersects the results, i.e., does not leverage 
the effect of doing one select and using its output to prune some of 
the work of the other. In particular, this effect is leveraged by the 
2-kNN-select algorithm when k\ and ki have different values. 

Figure 26 illustrates how the 2-fcNN algorithm can outperform 
the conceptually correct QEP by almost two orders of magnitude. 
In this experiment, we fix fci = 10 and vary ki. The x-axis of 
the figure is log 2 (fc2/fci). As the ratio fci/fo increases, the per- 
formance of the conceptually correct QEP degrades. The 2-fcNN- 
select algorithm has almost constant performance, as it adjusts the 
search threshold corresponding to the predicate of higher fc value 
to cover just the output of the predicate of lower k value. 




log(k2;k1) 




log(k2/k1) 



Figure 26: Execution time of a query with two fcNN-selects. The 
2-fcNN-select algorithm outperforms the conceptually correct 
QEP by almost two orders of magnitude. 



7. CONCLUSIONS 

In this paper, we presented the first complete study for the op- 
timization of queries with two fcNN predicates. We demonstrated 
how traditional optimization techniques can compromise the cor- 
rectness of evaluation for a query that involves two interacting 
fcNN predicates. For different combinations of two fcNN predi- 
cates, we presented efficient algorithms that guarantee the correct- 
ness of evaluation, and outperform the corresponding conceptually 
correct QEPs by orders of magnitude. 

The algorithms presented in this paper are designed for snapshot 
queries. Applying further optimization techniques that can support 
incremental evaluation of continuous queries with two fcNN predi- 
cates is a potential future work. Moreover, we believe that the ideas 
presented in this paper pave the way towards a query optimizer that 
can support spatial queries with more than two fcNN predicates. 
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